New York City flights

Guiding solution. An analysis of the nycflights13 datasets. Mandatory project report in Tools for Analytics (R part).

Lars Relund Nielsen
2021-09-23

Introduction

We consider the datasets available from the package nycflights13 that contains information about every flight that departed from New York City in 2013. Let us have a look at the datasets. First, we load the packages need for this report:

The datasets in the nycflights13 package are:

Dataset Description
airlines Airline names.
airports Airport metadata
flights Flights data
planes Plane metadata.
weather Hourly weather data

Let us try to do some descriptive analytics on the different datasets.

Flights

I this section we will focus on the flights data set, which lists all domestic flights out of the New York area in 2013. We run skim to get an overview:

skim(flights)
Table 1: Data summary
Name flights
Number of rows 336776
Number of columns 19
_______________________
Column type frequency:
character 4
numeric 14
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
carrier 0 1.00 2 2 0 16 0
tailnum 2512 0.99 5 6 0 4043 0
origin 0 1.00 3 3 0 3 0
dest 0 1.00 3 3 0 105 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2013.00 0.00 2013 2013 2013 2013 2013 ▁▁▇▁▁
month 0 1.00 6.55 3.41 1 4 7 10 12 ▇▆▆▆▇
day 0 1.00 15.71 8.77 1 8 16 23 31 ▇▇▇▇▆
dep_time 8255 0.98 1349.11 488.28 1 907 1401 1744 2400 ▁▇▆▇▃
sched_dep_time 0 1.00 1344.25 467.34 106 906 1359 1729 2359 ▁▇▇▇▃
dep_delay 8255 0.98 12.64 40.21 -43 -5 -2 11 1301 ▇▁▁▁▁
arr_time 8713 0.97 1502.05 533.26 1 1104 1535 1940 2400 ▁▃▇▇▇
sched_arr_time 0 1.00 1536.38 497.46 1 1124 1556 1945 2359 ▁▃▇▇▇
arr_delay 9430 0.97 6.90 44.63 -86 -17 -5 14 1272 ▇▁▁▁▁
flight 0 1.00 1971.92 1632.47 1 553 1496 3465 8500 ▇▃▃▁▁
air_time 9430 0.97 150.69 93.69 20 82 129 192 695 ▇▂▂▁▁
distance 0 1.00 1039.91 733.23 17 502 872 1389 4983 ▇▃▂▁▁
hour 0 1.00 13.18 4.66 1 9 13 17 23 ▁▇▇▇▅
minute 0 1.00 26.23 19.30 0 8 29 44 59 ▇▃▆▃▅

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
time_hour 0 1 2013-01-01 05:00:00 2013-12-31 23:00:00 2013-07-03 10:00:00 6936

The variables in this dataset are:

For further details about the dataset see ?flights or the online documentation.

The skim output indicate that some flights are canceled. We remove these observations from the dataset:

dat <- flights %>%
  filter(!is.na(dep_time))

Joining datasets

Let us first try to do some mutating joins and combine variables from multiple tables. In one flights we have flight information with an abbreviation for carrier (carrier), and in airlines we have a mapping between abbreviations and full names (name). You can use a join to add the carrier names to the flight data:

dat <- dat %>% 
  left_join(airlines) %>% 
  rename(carrier_name = name) %>% 
  print()
# A tibble: 328,521 × 20
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 328,511 more rows, and 13 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, carrier_name <chr>

Note we here join by the column carrier represented in both data frames. That is, the default argument by = c("carrier" = "carrier") is used. If we want the full name of origin airport, we need to specify which one we want to join to since each flight has an origin and destination airport. Afterwards we do the same for the destination airport.

dat <- dat %>% 
  left_join(airports %>% select(faa, name), 
            by = c("origin" = "faa")) %>% 
  rename(origin_name = name) %>% 
  left_join(airports %>% select(faa, name), 
            by = c("dest" = "faa")) %>% 
  rename(dest_name = name) %>% 
  select(month, carrier_name, origin_name, dest_name, sched_dep_time, dep_delay, arr_delay, distance, tailnum) %>% 
  print()
# A tibble: 328,521 × 9
   month carrier_name  origin_name dest_name  sched_dep_time dep_delay
   <int> <chr>         <chr>       <chr>               <int>     <dbl>
 1     1 United Air L… Newark Lib… George Bu…            515         2
 2     1 United Air L… La Guardia  George Bu…            529         4
 3     1 American Air… John F Ken… Miami Intl            540         2
 4     1 JetBlue Airw… John F Ken… <NA>                  545        -1
 5     1 Delta Air Li… La Guardia  Hartsfiel…            600        -6
 6     1 United Air L… Newark Lib… Chicago O…            558        -4
 7     1 JetBlue Airw… Newark Lib… Fort Laud…            600        -5
 8     1 ExpressJet A… La Guardia  Washingto…            600        -3
 9     1 JetBlue Airw… John F Ken… Orlando I…            600        -3
10     1 American Air… La Guardia  Chicago O…            600        -2
# … with 328,511 more rows, and 3 more variables: arr_delay <dbl>,
#   distance <dbl>, tailnum <chr>

We now have the flights data we need stored in the data frame dat. Let us try to answer some questions.

How many flights leave each New York airport for each carrier?

We first calculate a summary table:

dat %>% 
  count(origin_name, carrier_name, sort = TRUE) %>% 
  paged_table()

Let us visualize the numbers. First we facet by airport and use geom_bar:

dat %>% 
  ggplot(aes(carrier_name)) +
  geom_bar() + 
  facet_grid(rows = vars(origin_name)) + 
  labs(
    title = "Number of flights",
    x = "Carrier",
    y = "Flights"
  ) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

We can also compare the two categorical variables by using geom_count:

dat %>%
  ggplot(aes(origin_name, carrier_name)) + 
  geom_count() + 
  labs(
    title = "Number of flights",
    y = "Carrier",
    x = "Departure airport",
    size = "Flights"
  ) 

Finally, we can use a heatmap by using geom_tile. In this case, geom_tile doesn’t offer a way to calculate counts on it’s own, so we use the function count in our pipe:

dat %>%
  count(origin_name, carrier_name) %>%
  ggplot(aes(origin_name, carrier_name, fill = n)) + 
  geom_tile() + 
  labs(
    title = "Number of flights",
    y = "Carrier",
    x = "Departure airport",
    fill = "Flights"
  ) 

Carrier flights per month

Summaries are:

dat %>%
  count(month, carrier_name, sort = TRUE) %>%
  paged_table()

We will try to visualize the numbers using a line plot with carrier as color aesthetic:

dat %>%
  count(month, carrier_name) %>%
  ggplot(mapping = aes(x = month, y = n, color = carrier_name)) +
  geom_line() +
  geom_point() +
  geom_dl(aes(label = carrier_name), method = list(dl.trans(x = x + .3), "last.bumpup")) + 
  scale_x_continuous(breaks = 1:12, limits = c(1,17)) + 
  labs(
    title = "Number of flights",
    y = "Flights",
    x = "Month"
  ) + 
  theme(legend.position = "none") 

Which carriers/airlines have the worst delays?

Note that delays columns are in minutes. We first convert them to hours:

dat <- dat %>% 
  mutate(across(contains("delay"), ~ .x / 60))

Next, we answer the question by looking at different measures.

Average delay

Let us first have a look at the average departure delay by airline. The dplyr package has two functions that make it easy to do that: the group_by and the summarize functions. We use the two together and groups the rows of the dataset together based on the carrier and then uses summarise and the mean function to calculate the average delay:

dat %>% 
  group_by(carrier_name) %>% 
  summarise(ave_delay = mean(dep_delay, na.rm = TRUE)) %>% 
  arrange(desc(ave_delay)) %>% 
  paged_table()

Note the mean function have a na.rm argument which ignores the missing values otherwise the average delays could not be calculated. We can visualize our summary (a continuous-categorical comparison) by piping the table into a column plot:

dat %>% 
  group_by(carrier_name) %>% 
  summarise(ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(carrier_name, ave_delay)) + 
  geom_col()

To get a better visualization we reorder the categorical x-axis by average delay, use the full names of the airlines (which are rotated) and add some informative labels:

dat %>% 
  group_by(carrier_name) %>% 
  summarise(ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(reorder(carrier_name, ave_delay), 60 * ave_delay)) + 
  geom_col() + 
  labs(
    title = "Average departure delay for each carrier",
    x = "Carrier",
    y = "Delay (minutes)"
  ) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

To conclude, Frontier (F9) and Express Jet (EV) have the highest average delay. However, using mean to summarize a value can be dangerous, because it’s sensitive to outliers!

Variation

We should always ask about the variation in the variables in our data sets, but it’s especially important to do so if we’re going to use averages to summarize them.

First let us calculate the standard deviation for each carrier:

dat %>% 
  group_by(carrier_name) %>% 
  summarise(ave_delay = mean(dep_delay, na.rm = TRUE), std = sd(dep_delay, na.rm = TRUE)) %>% 
  arrange(desc(std)) %>% 
  paged_table()

What is the distribution of departure delays by airline? Visualized as a density distribution using carrier as fill aesthetic:

dat %>%
  ggplot(aes(dep_delay, fill = carrier_name)) + 
  geom_density(alpha = 0.5) + 
  labs(
    title = "Departure delay desities for each carrier",
    x = "Delay (hours)",
    y = "Density",
    fill = "Carrier"
  ) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) 

We can see that there is a small number of HUGE outliers which makes using mean possibly very misleading.

Lets us try to make a plot of the empirical cumulative distributions for each carrier using carrier as color aesthetic and a zoom of at most 3 hours delay:

dat %>%
  ggplot() + 
  stat_ecdf(aes(x = dep_delay, color = carrier_name), alpha = 0.75) +
  coord_cartesian(xlim = c(-0.1,3)) +  
  labs(
    title = "Departure delay empirical cumulative distributions",
    x = "Delay (hours)",
    y = "Probability",
    color = "Carrier"
  ) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Note, the higher upper-left the distribution is, the better. That is, a carrier dominates other carriers if the line is above the other carriers. Comparing this to the standard deviations, we see that the standard deviations is not a good measure for delays.

Variation in data like these where the outliers are very sparse is hard to visualize using density plots. We may also use a boxplot:

dat %>%
  ggplot(aes(carrier_name, dep_delay)) + 
  geom_boxplot() + 
  labs(
    title = "Variation in departure delay for each carrier",
    x = "Carrier",
    y = "Delay (hours)"
  ) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

We can see that most flights have a median around zero. However, some carriers have larger delays compared to others. Is the variation in departure delay different given departure airport? We use departure airport as color aesthetic:

dat %>%
  ggplot(aes(carrier_name, dep_delay, color = origin_name)) + 
  geom_boxplot() + 
  labs(
    title = "Variation in departure delay",
    x = "Carrier",
    y = "Delay (hours)",
    color = "Departure airport"
  ) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), 
        legend.position = "bottom")

This does not seem to be the case for most carriers.

Median

The boxplot shows median values in the center. What would happen if we used median instead of average delay time and make a column plot?

dat %>% 
  group_by(carrier_name) %>% 
  summarise(median_delay = median(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(reorder(carrier_name, median_delay), median_delay)) + 
  geom_col() + 
  labs(
    title = "Median departure delay for each carrier",
    x = "Carrier",
    y = "Median delay (hours)"
  ) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

That tells a bit of a different story! Fly SkyWest (OO) and you’ll get to leave six minutes early. Seemingly small, simple differences in the tools you choose when exploring data can lead to visualizations that tell very different stories.

Delays of more than an hour

How many flights were really delayed and how does that break down by airline carrier? Being delayed more than an hour really sucks, so let’s use that as our cutoff:

dat %>%
  filter(dep_delay > 1)
# A tibble: 26,581 × 9
   month carrier_name  origin_name dest_name  sched_dep_time dep_delay
   <int> <chr>         <chr>       <chr>               <int>     <dbl>
 1     1 Envoy Air     La Guardia  Charlotte…            630      1.68
 2     1 American Air… John F Ken… Miami Intl            715      1.18
 3     1 Envoy Air     John F Ken… Baltimore…           1835     14.2 
 4     1 United Air L… Newark Lib… General E…            733      2.4 
 5     1 United Air L… La Guardia  George Bu…            900      2.23
 6     1 ExpressJet A… Newark Lib… Savannah …            944      1.6 
 7     1 Envoy Air     La Guardia  Minneapol…           1150      1.18
 8     1 JetBlue Airw… John F Ken… Los Angel…           1220      1.28
 9     1 ExpressJet A… La Guardia  Memphis I…           1250      1.17
10     1 ExpressJet A… Newark Lib… Richmond …           1310      1.92
# … with 26,571 more rows, and 3 more variables: arr_delay <dbl>,
#   distance <dbl>, tailnum <chr>

That’s a lot of flights! We can use the dplyr function named count to give us a summary of the number of rows of a that correspond to each carrier:

dat %>%
  filter(dep_delay > 1) %>%
  count(carrier_name, sort = TRUE)
# A tibble: 16 × 2
   carrier_name                    n
   <chr>                       <int>
 1 ExpressJet Airlines Inc.     6861
 2 JetBlue Airways              4571
 3 United Air Lines Inc.        3824
 4 Delta Air Lines Inc.         2651
 5 American Airlines Inc.       2003
 6 Envoy Air                    1996
 7 Endeavor Air Inc.            1966
 8 Southwest Airlines Co.       1061
 9 US Airways Inc.               766
10 Virgin America                363
11 AirTran Airways Corporation   314
12 Mesa Airlines Inc.             79
13 Frontier Airlines Inc.         73
14 Alaska Airlines Inc.           39
15 Hawaiian Airlines Inc.         10
16 SkyWest Airlines Inc.           4

Note that count has created a column named n which contains the counts and we ask it to sort that column for us.

We can visualize it with a column plot (note we don’t need to reorder because count has done that for us):

dat %>%
  filter(dep_delay > 1) %>%
  count(carrier_name, sort = TRUE) %>%
  mutate(carrier_name = factor(carrier_name, levels = carrier_name, ordered = TRUE)) %>%
  ggplot(aes(carrier_name, n)) + 
  geom_col() + 
  labs(
    title = "Number of flights delayed more than one hour",
    x = "Carrier",
    y = "Flights"
  ) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

It seems that ExpressJet (EV) have a problem. They have a lot of very delayed flights.

What is the relationship between departure delay and arrival delay?

We plot the delays against each other as points.

ggplot(data = dat, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_point(alpha = 0.1) + 
  labs(
    title = "Departure against arrival delay",
    x = "Departure delay (hours)",
    y = "Arrival delay (hours)"
  ) 

The large mass of points near (0, 0) can cause some confusion since it is hard to tell the true number of points that are plotted. This is the result of a phenomenon called overplotting. As one may guess, this corresponds to points being plotted on top of each other over and over again. When overplotting occurs, it is difficult to know the number of points being plotted. We adjust the transparency of the points by setting alpha = 0.1

As we can see there is a linear relationship between the points departure delays result in arrival delays as expected. In general we can not fly faster to catch up with the delay.

Are flight delays worse at different New York airports?

If you’re flying out of New York you might want to know which airport has the worst delays on average. We first calculate median and average delays:

dat %>% 
  group_by(origin_name) %>% 
  summarize(ave_delay = mean(dep_delay, na.rm = TRUE), median_delay = median(dep_delay, na.rm = TRUE))
# A tibble: 3 × 3
  origin_name         ave_delay median_delay
  <chr>                   <dbl>        <dbl>
1 John F Kennedy Intl     0.202      -0.0167
2 La Guardia              0.172      -0.05  
3 Newark Liberty Intl     0.252      -0.0167

As we can see La Guardia seems to have the smallest delays. However, the difference is small. Lets us try to make a plot of the empirical cumulative distributions for each airport using airport as color aesthetic and a zoom of at most 2 hours:

dat %>%
  ggplot() + 
  stat_ecdf(aes(x = dep_delay, color = origin_name), alpha = 0.75) +
  coord_cartesian(xlim = c(-0.1,2)) +  
  labs(
    title = "Departure delay empirical cumulative distributions",
    x = "Delay (hours)",
    y = "Probability",
    color = "Departure airport"
  ) + 
  theme(legend.position = "bottom")

The median values can be found at y = 0.5. Note that La Gaardia is above the other lines indicating that it has the smallest delays no matter what fractile we consider. Another way to visialize this covariation in a categorical (airport) and a continuous (delay) variable is with a boxplot. We use a little scaling to get a better picture of the average delay and zoom so the y variable is between at most half and hour.

dat %>%
  ggplot(aes(origin_name, dep_delay)) + 
  geom_boxplot() + 
  coord_cartesian(ylim = c(-0.1, 0.5)) + 
  labs(
    title = "Departure delay",
    x = "Airport",
    y = "Delay (hours)"
  ) 

Are carrier flight delays different at New York airports?

We first calculate median and average delays:

dat %>% 
  group_by(carrier_name, origin_name) %>% 
  summarize(ave_delay = mean(dep_delay, na.rm = TRUE), median_delay = median(dep_delay, na.rm = TRUE)) %>% 
  paged_table()

There are some differences. Let us try to do a heat map of the average delays:

dat %>%
  group_by(origin_name, carrier_name) %>%
  summarize(ave_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(origin_name, carrier_name, fill = 60*ave_delay)) + 
  geom_tile() + 
  scale_fill_continuous(low = "#31a354", high = "#e5f5e0") + 
  labs(
    title = "Average departure delays",
    x = "Departure airport",
    y = "Carrier",
    fill = "Ave. delay (min)"
  ) 

For each carrier this give a good insight into the differences at each airport. Another way to visualize the covariation is with a box plot. We use a little scaling to get a better picture of the delay and zoom so the delay is a most half an hour.

dat %>%
  ggplot(aes(carrier_name, 60*dep_delay, fill = origin_name)) + 
  geom_boxplot() + 
  coord_cartesian(ylim = c(-10, 30)) + 
  labs(
    title = "Departure delay",
    x = "Carrier",
    y = "Delay (min)",
    fill = "Departure airport"
  ) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

We may also try to plot the empirical cumulative distributions for each carrier (facet) using airport as color aesthetic and a zoom of the delay is at most 1 hour:
dat %>%
  ggplot() + 
  stat_ecdf(aes(x = dep_delay, color = origin_name), alpha = 0.75) +
  coord_cartesian(xlim = c(-0.1, 1)) +  
  facet_wrap(vars(carrier_name)) +
  labs(
    title = "Departure delay empirical cumulative distributions",
    x = "Delay (hours)",
    y = "Probability",
    color = "Departure airport"
  ) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
        legend.position = "bottom")

Does departure time affect flight delays?

First, note that the sched_dep_time is a number in the format HHMM. We convert it into a hour:minutes data type and afterwards to hours since midnight:

dat <- dat %>% 
  mutate(sched_dep_time = hm(str_replace(sched_dep_time, "^(.*)(..)$", "\\1:\\2"))) %>% 
  mutate(sched_dep_time = as.numeric(sched_dep_time)/60/60)

To explore covariation in two continuous (quantitative) variables, we can use a scatter plot:

dat %>%
  ggplot(aes(sched_dep_time, dep_delay, color = origin_name)) + 
  geom_point(alpha = 0.1) + 
  # geom_smooth() +  
  labs(
    title = "Departure delay given departure time",
    y = "Delay (hours)",
    x = "Departure time (hours after midnight) ",
    color = "Departure airport"
  ) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
        legend.position = "bottom")

Based on the plot there does not seem to be a clear effect for the different airports.

Does travel distance affect departure and arrival delay?

We use the patchwork package to plot distance against the two delays:

p1 <- dat %>% 
  ggplot(aes(x=distance, y= dep_delay)) + 
  geom_point(alpha = 0.1) +
  geom_smooth() +  
  labs(
    y = "Dept. delay (hours)",
    x = "Distance"
  ) 

p2 <- dat %>% 
  ggplot(aes(x=distance, y= arr_delay)) + 
  geom_point(alpha = 0.1) +
  geom_smooth() +  
  labs(
    y = "Arrival delay (hours)",
    x = "Distance"
  ) 

p1 + p2

Based on the plot there does not seem to be a clear effect for the different airports.

Planes

Let us do a mutation join so we have a bit more information about each airplane:

dat <- dat %>% 
  left_join(planes %>% 
              select(tailnum, plane_manufacturer = manufacturer, plane_model = model))

Find the monthly usage of all the aircrafts

This could be useful for some kind of maintenance activity that needs to be done after x number of trips. The summary table is (based on tailnum):

dat %>% 
  count(tailnum, month) %>% 
  paged_table()

As an example, consider the plane N355NB:

dat1 <- dat %>% 
  filter(tailnum=="N355NB") 

The specifications are:

filter(planes, tailnum=="N355NB")
# A tibble: 1 × 9
  tailnum  year type     manufacturer model engines seats speed engine
  <chr>   <int> <chr>    <chr>        <chr>   <int> <int> <int> <chr> 
1 N355NB   2002 Fixed w… AIRBUS       A319…       2   145    NA Turbo…

We see that it is an Airbus 319 with 145 seats. The plane flew 124 flights in 2013 with a total distance of 1.03089^{5}.
Let us have a look at the destinations:

dat1 %>% 
  count(dest_name) %>% 
  ggplot(aes(x = reorder(dest_name, -n), y = n)) +
  geom_col()

Weather

I this section we will focus on the weather data set, which lists hourly meterological data for LGA, JFK and EWR. We run skim to get an overview:

skim(weather)
Table 2: Data summary
Name weather
Number of rows 26115
Number of columns 15
_______________________
Column type frequency:
character 1
numeric 13
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
origin 0 1 3 3 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2013.00 0.00 2013.00 2013.00 2013.00 2013.00 2013.00 ▁▁▇▁▁
month 0 1.00 6.50 3.44 1.00 4.00 7.00 9.00 12.00 ▇▆▆▆▇
day 0 1.00 15.68 8.76 1.00 8.00 16.00 23.00 31.00 ▇▇▇▇▆
hour 0 1.00 11.49 6.91 0.00 6.00 11.00 17.00 23.00 ▇▇▆▇▇
temp 1 1.00 55.26 17.79 10.94 39.92 55.40 69.98 100.04 ▂▇▇▇▁
dewp 1 1.00 41.44 19.39 -9.94 26.06 42.08 57.92 78.08 ▁▆▇▇▆
humid 1 1.00 62.53 19.40 12.74 47.05 61.79 78.79 100.00 ▁▆▇▇▆
wind_dir 460 0.98 199.76 107.31 0.00 120.00 220.00 290.00 360.00 ▆▂▆▇▇
wind_speed 4 1.00 10.52 8.54 0.00 6.90 10.36 13.81 1048.36 ▇▁▁▁▁
wind_gust 20778 0.20 25.49 5.95 16.11 20.71 24.17 28.77 66.75 ▇▅▁▁▁
precip 0 1.00 0.00 0.03 0.00 0.00 0.00 0.00 1.21 ▇▁▁▁▁
pressure 2729 0.90 1017.90 7.42 983.80 1012.90 1017.60 1023.00 1042.10 ▁▁▇▆▁
visib 0 1.00 9.26 2.06 0.00 10.00 10.00 10.00 10.00 ▁▁▁▁▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
time_hour 0 1 2013-01-01 01:00:00 2013-12-30 18:00:00 2013-07-01 14:00:00 8714

For further details see View(weather) or read the associated help file by running ?weather to bring up the help file.

Observe that there is a variable called temp of hourly temperature recordings in Fahrenheit at weather stations near all three major airports in New York City: Newark (origin code EWR), John F. Kennedy International (JFK), and LaGuardia (LGA). Let us transform the temperature to celsius:

dat_w <- weather %>% 
  left_join(airports %>% select(faa, name), 
            by = c("origin" = "faa")) %>% 
  rename(origin_name = name) %>% 
  mutate(temp = (temp - 32) * (5/9) ) %>% 
  select(origin_name, time_hour, month, temp)

How are the temperature fluctutating over the year?

We start by plotting temperature over the year with airport/origin as color aesthetic. We also add a smoothing line:

dat_w %>% 
  ggplot(mapping = aes(x = time_hour, y = temp, color = origin_name)) +
  geom_line(alpha = 0.2) + 
  geom_smooth(alpha = 0.25)

Note that we have used the alpha aesthetic to make the lines more transparent. There are many fluctuations; however, the temperature cycle from winter to summer is clear. Moreover, JFK seem to have a outlier in May (approx. -10 degrees) proberly due to a faulty measurement.

Are the temparatures different in the airports?

Let us start by plotting the density for each airport:

dat_w %>% 
  ggplot(mapping = aes(x = temp, fill = origin_name)) +
  geom_density(alpha=0.75) +
  geom_vline(
    data = dat_w %>% 
       group_by(origin_name) %>% 
       summarise(m = mean(temp, na.rm = TRUE)), 
    mapping = aes(xintercept = m, color = origin_name)
  )

Note the mean temparature is more or less the same (vertical lines). There is a bit fluctuations on Newark compared to for instance JFK airport (lowest spread).

A closer look can be done by faceting by month:

dat_w %>% 
  ggplot(mapping = aes(x = temp, fill = origin_name)) +
  geom_density(alpha=0.5) + 
  facet_wrap(vars(month))

Finally, let us consider a boxplot of temparature for each month:

ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
  geom_boxplot()

The resulting plot shows 12 separate boxplots side by side and illustrates the variablity and flucturations over the year.

What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point.

Cancelled flights

The canceled flights are:

dat_c <- flights %>%
  filter(is.na(dep_time)) 

Give comments based on your intuition. Is the analysis valid?

A few examples on analysis:

We first get the full name of carriers by joining the canceled flights table and airlines table.

dat_c <- dat_c %>% 
  left_join(airlines) %>% 
  mutate(sched_dep_time = hm(str_replace(sched_dep_time, "^(.*)(..)$", "\\1:\\2"))) %>% 
  mutate(sched_dep_time = as.numeric(sched_dep_time)/60/60) %>% 
  print()
# A tibble: 8,255 × 20
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <dbl>     <dbl>    <int>
 1  2013     1     1       NA           16.5        NA       NA
 2  2013     1     1       NA           19.6        NA       NA
 3  2013     1     1       NA           15          NA       NA
 4  2013     1     1       NA            6          NA       NA
 5  2013     1     2       NA           15.7        NA       NA
 6  2013     1     2       NA           16.3        NA       NA
 7  2013     1     2       NA           13.9        NA       NA
 8  2013     1     2       NA           14.3        NA       NA
 9  2013     1     2       NA           13.4        NA       NA
10  2013     1     2       NA           15.8        NA       NA
# … with 8,245 more rows, and 13 more variables:
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, name <chr>
colnames(dat_c)[colnames(dat_c) == "name"] <-  "carrier_name"

Number of canceled flights in each month:

dat_c %>%
  ggplot(aes(factor(month))) + 
  geom_bar() + 
  labs(
    title = "Number of canceled flights",
    x = "Month",
    y = "Flights"
  ) 

Number of canceled flights over the day:

dat_c %>%
  ggplot(aes(sched_dep_time)) + 
  geom_histogram(binwidth = 2) + 
  labs(
    title = "Number of canceled flights",
    x = "Hour",
    y = "Flights"
  ) 

Number of canceled flights per carrier:

dat_c %>%
  count(carrier_name) %>% 
  ggplot(aes(x = reorder(carrier_name, n), y = n)) + 
  geom_col() + 
  labs(
    title = "Number of canceled flights",
    x = "Carrier",
    y = "Flights"
  ) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Number of canceled flights per carrier from each origin airport:

dat_c %>% 
  ggplot(aes(carrier_name)) +
  geom_bar() + 
  facet_grid(rows = vars(origin)) + 
  labs(
    title = "Number of canceld flights from each departure airport",
    x = "Carrier",
    y = "Flights"
  ) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Other comments

No solution are provided here. Give comments based on your intuition. Is the analysis valid?

Colophon

This report has been created inside RStudio using R Markdown and the distill format.

The report was built using:

 setting  value                       
 version  R version 4.1.1 (2021-08-10)
 os       macOS Big Sur 10.16         
 system   x86_64, darwin19.6.0        
 ui       unknown                     
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Europe/Copenhagen           
 date     2021-09-23                  

Along with these packages:

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".